knitr::include_graphics("Picture1.jpg")
The housing crisis in UC Santa Barbara is a growing problem. Many students have difficulty seeking affordable housing. The average rent for a 1 bedroom apartment in UCSB, CA is currently 2798. This is a 61% increase compared to the previous year. The university plans to construct Muger Hall to accommodate more students and fulfill obligations mentioned in the 2010 Long Range Development Plan. However, the design of Muger Hall, which is windowless, caused many people controversy and opposition. Many media report this plan, and below is one of the reports. The housing issues in my school arise my interest in housing prices. After searching on the internet, I find one data set for California housing prices, and I could utilize the method learned in class to find out which variable relate to the housing price and employ different machine learning techniques to make a prediction of housing price.
library("vembedr")
embed_url("https://www.youtube.com/watch?v=9ZJSaqOWOqI")
The main goal of this project is utilizing different machine learning models to predict median_house_value, and find the best-performance model. I may utilize this model to predict future housing price based on the newest data. Also, I plan to do some data visualization to find relationship among different variable especially median house value.
Loading Data
# read in data
housing <- read.csv("housing.csv")
head(housing)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The data was downloaded from Kaggle Link: https://www.kaggle.com/datasets/camnugent/california-housing-prices Source:This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data.
Below are description of the variables in the data set and code book is also available on my github page.
Loading Package
Below are packages I plan to use in this project
# load packages
library(tidyverse)
library(lubridate)
library(tidymodels)
library(janitor)
library(naniar)
library(corrr)
library(gridExtra)
library(randomForest)
library(kknn)
The data cleaning part is organized by the following step 1. Check missing value 2. Processing categorical varialbe
Check Missing Value
In this part, we plan to use vis_miss function in naniar package to visualizing the missing data +The function vis_miss provides a summary of whether the data is missing or not. +It also provides the amount of missings in each columns.
# use vis_miss function in naniar package to visualize the missing value
vis_miss(housing)
I found only one variable total_bedrooms have missing value.
# check the number of missing value
total_bedrooms <- housing$total_bedrooms
summary(total_bedrooms)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 296.0 435.0 537.9 647.0 6445.0 207
The total_bedrooms has 207 NA values. I consider using median of total_bedrooms to impute missing values.
# using median of total_bedrooms to impute missing value
housing <- housing %>% mutate(total_bedrooms = ifelse(is.na(total_bedrooms),
median(total_bedrooms, na.rm = T),
total_bedrooms))
Check if there is any missing value
# Check if there is any missing value
sum(is.na(housing))
## [1] 0
After checking, we do not have any missing value in our data set.
Summary the whole data set
summary(housing)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 297.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 536.8 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 643.2 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
Categorical Variable
I notice that ocean_proximity is character variable and I want to change that to factor variable
# change and show the levels of ocean_proximity
housing$ocean_proximity <- as.factor(housing$ocean_proximity)
levels(housing$ocean_proximity)
## [1] "<1H OCEAN" "INLAND" "ISLAND" "NEAR BAY" "NEAR OCEAN"
Ocean_proximity has 5 levels to illustrate location of the house with respect to ocean/sea.
I plan to visualize the median house value based on house location (x = longitude, y = latitude). The dot shows the population density and color represents the median house value
map_view <- housing %>%
ggplot(aes(x = longitude, y = latitude, color = median_house_value)) +
geom_point(aes(size = population), alpha = 0.5) +
scale_color_distiller(palette = "Spectral") +
xlab("Longitdue") +
ylab("Latitude") +
labs(title = "Median House Value Visualization", colour = "Median House Value", size = "Population")+
theme_minimal()
map_view
By observing the Median House Value with respect to the location, I find that houses close to the ocean/sea have a higher median house value compared with inland houses. One reasonable explanation could be most metropolitan (LA Bay Area) is close to the ocean/sea. A large portion of the population lives in these areas, which may cause great demand for housing and pull up the median house value. Another possible reason could also be from the demand side. The houses close to the ocean/sea are more attractive with better living environments. Also interesting to note that the population density and median housing value inland are much smaller. Therefore, ocean_proximity may be an important predictor of median housing value, and I will focus on ocean_proximity in the following exploratory data analysis.
I plan to draw histogram for ocean_proximity
housing %>%
ggplot(aes(x = factor(ocean_proximity))) +
geom_histogram(stat = 'count', fill = "blue1") +
geom_text(aes(label = ..count..), stat = "count", vjust = -0.5, colour = "black") +
xlab("Ocean_proximity")+
ylab("Count")+
theme_minimal()
By observing count of each level of Ocean_proximity, I find the observation of ISLAND is much less than other levels. ISLAND only have 5 observations and other levels have over than 2000 observations. Therefore, we may consider to remove level of ISLAND in model fitting, which may have subtle impact on our final model performance.
housing[housing$ocean_proximity != "ISLAND", ] %>%
ggplot(aes(factor(ocean_proximity), median_house_value)) +
geom_boxplot()+
xlab("Ocean_proximity")+
ylab("Median house value")+
theme_minimal()
The boxplot above can also show housings close to sea/ocean have higher median house value. Another interesting fact is that the minimum median housing value near the bay/ocean is larger than the maximum median housing value inland. From the analysis and visualization above, we could know location plays an important role in house value.
In this part, I use correlation plot to show the relationship of different variables in our data set
cor_housing <- housing %>%
select(-c(longitude,latitude,total_bedrooms)) %>%
select(is.numeric) %>%
correlate()
cor_housing %>%
stretch() %>%
ggplot(aes(x,y,fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r))))
In the correlation plot above, I could know median_income and
median_house_value have a positive correlation (0.69). This conforms to
everyday perceptions: family with high median_income could afford more
expensive house. Therefore, median income could be our second important
variable to predict median_house_value.
Some other correlation listed below:
Graph below are density estimation of median_housing value and income distribution and red line represents mean and blue line represents median.
income <- housing %>% ggplot(aes(x = median_income)) +
geom_density(fill = 'blue', alpha = 0.7) +
geom_vline(aes(xintercept = mean(median_income)),
color = "firebrick1", linetype = "dashed") +
geom_vline(aes(xintercept = median(median_income)),
color = "blue2", linetype = "dashed") +
xlab("Median Income") +
labs(title = "Density estimation of median income") +
theme_minimal()
housing_value <- housing %>% ggplot(aes(x = median_house_value)) +
geom_density(fill = 'cadetblue1', alpha = 0.7) +
geom_vline(aes(xintercept = mean(median_house_value)),
color = "firebrick1", linetype = "dashed") +
geom_vline(aes(xintercept = median(median_house_value)),
color = "blue2", linetype = "dashed") +
xlab("Median House Value") +
labs(title = "Density estimation of median house value") +
theme_minimal()
grid.arrange(income, housing_value, ncol=2)
The distribution of Median Income and Median House Value seems both
right-skewed with mean value larger than the median value. This may
indicate the people’s median income match people’s median house value.
One interest point is that in the density estimation of median house
value there is one increasing tail. This may show only small portion of
individuals with high median income own these luxuray housing.
The data was split in a 80% training, and 20% testing. We also strata median_house_value, which make sure that both side of the split have roughly the same distribution for each value of strata.
# set seed so our result is reproducible
set.seed(3435)
housing_split <- housing %>%
initial_split(prop = 0.8, strata = "median_house_value")
housing_train <- training(housing_split)
housing_test <- testing(housing_split)
# checking dimension to verify that the correct number of observations are now in each data set
c(dim(housing_train), dim(housing_test))
## [1] 16510 10 4130 10
The training data set has 16510 observations, and test data set has 4130 observations. They both have desired number of observations.
v-fold cross-validation
Next, use v-fold cross-validation on the training set. Use 5 folds. Stratify the folds by median_house_value as well.
housing_folds <- vfold_cv(housing_train, v = 5, strata = median_house_value)
This part is organized by the following step: 1. Building the model 2. Running the model 3. Analyzing the model
Set up a recipe to predict median_house_value with longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, ocean_proximity.
housing_recipe <- recipe(median_house_value ~
longitude + latitude + housing_median_age
+ total_rooms + total_bedrooms
+ population + households
+ median_income + ocean_proximity, data = housing_train) %>%
step_dummy(ocean_proximity) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
To begin, I use linear_reg() and mixture = 0 to specify a ridge model, and use the glmnet as engine. Because our prediction is numeric, so we set our mode as regression. Also, we set penalty = tune(). This tells tune_gune grid that the penalty parameter should be tuned.
ridge_spec <-
linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_workflow <- workflow() %>%
add_recipe(housing_recipe) %>%
add_model(ridge_spec)
Next, I set up the tuning grid with range from -5 to 5 and levels equal to 20
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 20)
Finally, I executed my model by tuning and fitting.
ridge_res <- tune_grid(
ridge_workflow,
resamples = housing_folds,
grid = penalty_grid
)
Here, I plan to set random forest model. I plan to tune mtry and min_n, set mode to “regression”, and use the ranger engine. I stored this model and my recipe in a workflow.
rf_model <-
rand_forest(
mtry = tune(),
min_n = tune(),
mode = "regression") %>%
set_engine("ranger")
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(housing_recipe)
Next, I plan to set up the tuning grid. I set mtry ranges from 1 to 10, when we include all the predictors the model is bagging. The min_n ranges from 100 to 200. Also, considering the computation power, I set levels equal to 7.
rf_grid <- grid_regular(mtry(range = c(1,10)),
min_n(range = c(100,200)),
levels = 7)
Finally, I executed my model by tuning and fitting, and save the final result for future model comparsion.
rf_tune <- tune_grid(
rf_workflow,
resamples = housing_folds,
grid = rf_grid,
metrics = metric_set(rmse)
)
save(rf_tune, rf_workflow, file = "rf_tune.rda")
Similarly, I set the boosted tree model and tune parameters mtry and learn_rate. I set engine as xgboost and create a workflow.
bt_model <- boost_tree(mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost")
bt_workflow <- workflow() %>%
add_model(bt_model) %>%
add_recipe(housing_recipe)
Next, I set up a tuning grid with the similar process like random forest model.
bt_grid <- grid_regular(mtry(range = c(2,9)),
learn_rate(range = c(0.05,0.3)),
levels = 5)
Finally, I executed the model and save the final result for future model comparison
bt_tune <- tune_grid(
bt_workflow,
resamples = housing_folds,
grid = bt_grid,
metrics = metric_set(rmse)
)
save(bt_tune, bt_workflow, file = "bt_tune.rda")
In this part, I set Nearest Neighbors model. I plan to tune parameter neighbors and set model as regression and engine as kknn
knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "regression"
) %>% set_engine("kknn")
knn_workflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(housing_recipe)
Next, I set up the tuning grid
knn_grid <- grid_regular(neighbors(range = c(1,10)),
levels = 5)
Finally, I executed the model and save the final result for future model comparison
knn_tune <- tune_grid(
knn_workflow,
grid = knn_grid,
resamples = housing_folds,
metrics = metric_set(rmse)
)
save(knn_tune, knn_workflow, file = "knn_tune.rda")
load("rf_tune.rda")
load("bt_tune.rda")
load("knn_tune.rda")
we utilize autoplot to create a visualization of our result
autoplot(ridge_res)
We see that the amount of regularization affects the performance metrics
differently.
Here, we use select_by_one_std_err() which select the most simple model that is within in one standard error of the numerically optimal results.
best_penalty <- select_by_one_std_err(ridge_res, penalty, metric = "rmse")
best_penalty
## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config .best .bound
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0.00001 rmse standard 70066. 5 485. Preprocessor1_M… 70066. 70551.
We could use training data set to get the final train result and draw ggplot to know model performance
ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = housing_train)
housing_train_res <- predict(ridge_final_fit, new_data = housing_train %>% select(-median_house_value))
housing_train_res %>%
ggplot(aes(x = .pred, y = housing_train$median_house_value)) +
geom_point(alpha = 0.2) +
geom_abline(lty = 2) +
theme_bw() +
coord_obs_pred()
we use autoplot function to visualize our model
autoplot(rf_tune, metric = "rmse")
Next, we also use select_by_one_std_err() which select the most simple model
select_by_one_std_err(rf_tune,mtry,metric = "rmse")
## # A tibble: 1 × 10
## mtry min_n .metric .estimator mean n std_err .config .best .bound
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 5 100 rmse standard 54083. 5 516. Preprocesso… 53779. 54246.
Use autoplot to visualize the model
autoplot(bt_tune, metric = "rmse")
Next, we also use select_by_one_std_err() to select the model
select_by_one_std_err(bt_tune, mtry, metric = "rmse")
## # A tibble: 1 × 10
## mtry learn_rate .metric .estimator mean n std_err .config .best .bound
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 7 1.12 rmse standard 60677. 5 721. Prepro… 60677. 61398.
Use autoplot to visualize our model
autoplot(knn_tune, metric = "rmse")
Next, we also use select_by_one_std_err() to select the model
select_by_one_std_err(knn_tune, neighbors , metric = "rmse")
## # A tibble: 1 × 9
## neighbors .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 10 rmse standard 60466. 5 504. Preprocessor1… 60466. 60970.
After comparing the model I used above, Random Forest has the best performance. Therefore, I decide to use Random Forest as our Final Model.
rf_workflow_tuned <- rf_workflow %>%
finalize_workflow(select_by_one_std_err(rf_tune,mtry,metric = "rmse"))
Run the fit and then check our model performance
rf_results <- fit(rf_workflow_tuned, housing_train)
Now, lets fit the model to the testing data set and check our model performance
housing_metric <- metric_set(rmse)
model_test_predictions <- predict(rf_results, new_data = housing_test) %>%
bind_cols(housing_test %>% select(median_house_value))
model_test_predictions %>%
housing_metric(truth = median_house_value, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 52598.
Our model’s rmse is 53378 on our testing data set, which is close to the rmse on the training data set Thus, we do not have an overfitting issue and my model have a great performance
The housing crisis at UCSB sparks my interest in exploring California housing prices. In this project, I utilized different plots and graphs to represent the correlation between each variable. The focus of this project is using a cross-validation training set to train our machine learning model. Using these models, we could make predictions about housing median values. Through building and analyzing model performance, we find Random Forest has the best performance. Therefore, we employ Random Forest to fit our test set. The final result shows I have an acceptable Rmse value, which is also close to our training set. Our model has a great performance in predicting housing median values.
However, there are still several limitations in my project. The data set is out of date, and I may find the most recent data to train my model to get reliable results. Next, I do not include other machine learning models, which may have better performance. I will try some other models when I learn more about machine learning.
This quarter goes so fast. I really appreciate the effort made by our instructor Prof. Coburn and teaching assistant Hanmo Li to help me to explore this interesting topic. I enjoy the process of finishing this project and hope I could learn more to improve my project. Thanks again for people helping me this quarter.